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Abstract 

We present a new computational and statistical approach for fitting isotonic models under 
convex differentiable loss functions through recursive partitioning. Models along the parti- 
tioning path are also isotonic and can be viewed as regularized solutions to the problem. Our 
approach generalizes and subsumes the well-known work of Barlow and Brunk (1972) on fit- 
ting isotonic regressions subject to specially structured loss functions, and expands the range 
of loss functions that can be used (for example, adding Huber's loss for robust regression). 
This is accomplished through an algorithmic adjustment to the recursive partitioning approach 
recently developed for solving large scale /2-loss isotonic regression problems (Spouge et al. 
2003, Luss et al. 2011). We prove that the new algorithm solves the generalized problem while 
maintaining the favorable computational and statistical properties of the I2 algorithm. The re- 
sults are demonstrated on both real and synthetic data in two settings: fitting count data using 
negative Poisson log-likelihood loss, and fitting robust isotonic regressions using Huber's loss. 

Keywords: isotonic regression, nonparametric regression, regularization path, convex optimization 



1 Introduction 

In this paper, we generalize recently developed algorithms for solving large-scale isotonic regres- 
sions with l 2 loss function (Spouge et al. 2003, Luss et al. 2012) in order to handle a more general 
class of loss functions. These generalizations allow for fitting isotonic regressions with useful loss 
functions such as Huber's loss, which was previously impractical for large problems using generic 
convex optimization solvers. For example, isotonic regression with Huber's loss can be solved 
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with generic quadratic programming solvers that suffer due to the large number of constraints in 
our problems, whereas the algorithm we introduce takes advantage of the structured constraints 
and is more efficient by orders of magnitude. Isotonic regression is a nonparametric approach 
for building models whose fits are monotone in their covariates. Such assumptions are natural to 
applications in biology (Obozinski et al. 2008), ranking (Zheng et al. 2008), medicine (Schell & 
Singh 1997), statistics (Barlow & Brunk 1972) and psychology (Kruskal 1964). Assume n data 
observations (xi,yi), (x n ,y n ) and a partial order e.g., the standard Euclidean one where 
x\ -< %2 if and only if x±j < %2j coordinate-wise. We index the set of isotonicity constraints 
implied by the partial order by X = {(i, j) : Xj ^ Xj}. Classic isotonic regression considers the I2 
loss function and solves 



in y E R n . Throughout the paper we denote by m = \I\ the number of isotonic constraints and d 
the dimension of data, i.e., Xi G R d . 

While the assumption of isotonicity is often natural, isotonic regression has not been exten- 
sively applied in "modern" applications for two main reasons. As the number of observations n, 
the data dimensionality d, and the number of isotonicity constraints m increase, problem © suf- 
fers from computational as well statistical (i.e., overfitting) difficulties. These are reviewed in Luss 
et al. (2012), where it is argued that the computational difficulties can be overcome using modern 
algorithms, while overfitting can be addressed by regularizing the problem in ©, i.e., fitting "less 
complex" isotonic models than the optimal solution of (OQ). The Isotonic Recursive Partitioning 
(IRP) algorithm proposed in Spouge et al. (2003) and Luss et al. (2012) (following previous re- 
lated work by Maxwell & Muckstadt (1985) and Roundy (1986), among others) can easily solve 
problems with tens of thousands of observations, and is based on recursive partitioning of the 
covariate space and constructing isotonic models of increasing complexity, thus generating a regu- 
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larization path of isotonic models in the sense that isotonic models along the path are regularized 
by the number of partitions made. 

In this paper, we focus on a more general form of isotonic regression that minimizes a convex 
loss function subject to the isotonicity constraints, i.e., we solve 

mm {/(y) : y { < fa V(z,j) G X} (2) 

where / : R n — > R is a separable function such that 

n 

/(!/) = X>(fc), (3) 

i=l 



and fi : R — > R is differentiable and convex for all % = 1, . . . ,n. Typically, fiijji) = g(y~i,yi) 
measures the fit of yi to the observed response. Table [Dprovides several examples for the functions 
/*(•) that define /(•) above. 



p-norm loss, 1 < p < 2 


{rn - y t ) p 


5-Huber loss 


{Vi - Vi) 2 /^ for \yi - yi\ < 5 and S(\yi - yi\ - 5/2) otherwise 


Negative Poisson log-likelihood 


Vi ~ Vi ln (Vi) 


Negative Bernoulli log-likelihood 


-Vi In {Vi) - (1 - Vi) In (1 - fa) 



Table 1 : Examples of loss functions solvable by Algorithm [TJ 



The notion of generalized isotonic regression is not new. Barlow & Brunk (1972) defined a 
generalized loss function as 

n 

f{y) = ^2^{yi)-yiVi (4) 

i=l 

for proper convex $ : R — > R which was minimized subject to the same isotonicity constraints as 
in ©. They showed that this generalized isotonic regression problem can be solved equivalently as 
an instance of the l 2 isotonic regression ©. This implies that any large-scale algorithm for problem 
(QQ) can be used to solve isotonic regressions with objectives of the form ©. Generalized objectives 
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for large-scale Poisson and Bernoulli regressions as given in TableCQcan be solved in this manner, 
however the p-norm and Huber loss functions cannot. This relationship will be further formalized 
in Section l2~4l Generalized isotonic regression (using separable loss functions) in d = 1 dimension 
was also considered in Best et al. (2000) and Ahuja & Orlin (2001) using extensions of the pooled 
adjacent violators algorithm (PAVA). Neither assumes differentiability as done here, and hence 
both are amenable to a broader class of loss functions, albeit only in one dimension. Hochbaum 
& Queyranne (2003) offer a very efficient and related algorithm for problem © where the fits 
are restricted to being integer (called the convex cost closure problem). Their algorithm can be 
extended to the continuous case in the sense of determining an e-accurate solution by solving the 
integer problem on an e-grid. Depending on the required level of accuracy, their approach can be 
computationally competitive with our algorithm as described below for finding the solution of ©, 
however it lacks the natural statistical interpretation as a regularization path which our approach 
affords. The method of Hochbaum & Queyranne (2003) is discussed in more detail in Section |2~4| 
The main contribution of this paper is a generalization of IRP that can be used to solve large- 
scale multivariate generalized isotonic regressions of the form ©. Our generalization extends the 
methods to any convex differentiable loss function, including those mentioned in Tabled] and we 
term it generalized isotonic recursive partitioning (GIRP). As with IRP, the partitioning algorithm 
here addresses both of the main difficulties with isotonic regression discussed above. Firstly, it pro- 
vides a sequence of isotonic models of increasing complexity, converging to the globally optimal 
generalized isotonic solution. Early stopping along this "regularization path" is a useful approach 
to overcome overfitting concerns of the globally optimal solution; less complex isotonic models 
along the path often predict more accurately than the final overfit model. Secondly, it is computa- 
tionally efficient; the partitioning algorithm is an iterative scheme in which each iteration partitions 
a group of observations by solving a structured linear program for which very efficient algorithms 
exist. 

It should be emphasized that, while the algorithmic modification from IRP to GIRP is quite 
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minor and the proofs for GIRP properties are closely related to proofs of same results for IRP, the 
generalization has important practical implications because it significantly expands the range of 
applications for isotonic modeling, as our examples below illustrate. 

The paper continues as follows. Section |2] describes the known results for l 2 isotonic re- 
gression and generalizes them to the class of convex loss functions described by ®. The gen- 
eralized algorithm is described, and the relationship to Barlow & Brunk (1972) is formalized. 
Section [3] applies the results with Poisson log-likelihood and Huber's loss functions to synthetic 
and real data sets. A Matlab-based software package implementing our results is available at 
www .tau.ac. il/ ~ saharon/ files /GIRPvl . zip[ We first define terminology to be used 
throughout the remainder of the paper. 

1.1 Definitions 

Let V = {x±, . . . , x n } be the covariate vectors for n training points where Xi G M. d and denote 
Ui G R as the i th observed response. We will refer to a general subset of points A C V with no 
holes (i.e., x ■< y ^ z and x, z G A =>■ y G A) as a group. Throughout the paper, we will use the 
shorthand i G A = {i : Xi G A}. Denote by \A\ the cardinality of group A. The weight of a group 
A is denoted by 



For two groups A and B, we denote A ^ B if 3x G A, y G B such that x < y and $x G A,y G B 
such that y -< x (i.e., there is at least one comparable pair of points that satisfy the direction of 
isotonicity). A set of groups V is called isotonic if A ^ B =>- wa < wb, VA, B G V. A subset £ 
(U) of A is a lower set (upper set) ofAiixEA,y££,x~<y^>xe£(xeL(,y£A,x~< 
y =3- y G U). A group B C A is defined as a block of group A if w^n^ < % for each upper set 
U of A such that W fl B ^ {} (or equivalently if wcnB > for each lower set £ of A such that 
£ D B ^ {}). We denote the optimal solution for minimizing f(y) in the variable y by y*, i.e., 




(5) 
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y* = argmin f(y). The notation (df(y)/dy) \ y denotes the derivative of a function / with respect 
to the variable y evaluated at the point y. 

2 Generalized Isotonic Recursive Partitioning with Convex Loss 
Functions 

In this section, we generalize the results for / 2 isotonic partitioning resulting in an algorithm termed 
Generalized Isotonic Recursive Partitioning (GIRP), and derive useful properties of this general- 
ization. The solution at each iteration, as in IRP, is defined by groups that are proven to be the 
union of blocks in the optimal solution. Section [27T1 first gives an overview of the IRP algorithm. 
Section 12.21 then details the partitioning step for the generalized case and derives the resulting 
GIRP algorithm. When fi(y~i) = (yi — yi) 2 , that is, /(•) is the l 2 loss function, all results in this 
section replicate those of IRP and wa becomes the average of the observations in group A. Sec- 
tion [23] proves convergence of the partitioning algorithm to the global optimal solution of © and 
shows that the solution at each iteration of the algorithm is isotonic, i.e., the iterations provide a 
regularization path of isotonic solutions. 

2.1 Isotonic Recursive Partitioning with the ^ Loss Function 

We here briefly review the ideas of the IRP algorithm of Spouge et al. (2003) and Luss et al. 
(2012). The optimal solution to the Z 2 isotonic regression problem (Q3 is known to be defined by 
a partitioning of the observations yi into blocks in which yi = y,j if observations yi and yj are in 
the same block. Indeed, this structure can be seen through the optimality conditions (i.e., Karush- 
Kuhn-Tucker (KKT) conditions, see Boyd & Vandenberghe (2004)) for problem (H). The optimal 
solution to (OQ), denoted by y*, satisfies these conditions, which are given by 

(a) 2{y* - Vi ) = Er.u,i)ex ~ Ei^ec A *i V * e V 
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(b) y*<y* V(i,j)el 



(c) A*- > V(i,i)el 



(d) \; i (y*-y*) = o V(i,j)Gl, 



where A*- is the optimal dual variable associated with isotonicity constraint < Convexity 
of the l 2 loss function implies that any solution satisfying conditions (a)-(d) is a globally optimal 
solution. From condition (d), A*, > =>- y* — y*, i.e., the optimal solution is made of blocks V 
where X*j > for all i, j E V, which implies that all observations within a block V are fit to the 
same value. Furthermore, when restricting all fits within a block V to be equivalent, the isotonic 
regression problem over block V is the following unconstrained quadratic program 



which is trivially solved at y* = y v where y v denotes the average of all observations in block V. 
Condition (b) implies that these averages must satisfy isotonicity, i.e., if V~ ^ V + are isotonic 
blocks then y v - < y v + . Thus, the structure of the optimal solution is a partitioning of the set 
{1, . . . , n} into some (unknown number) K blocks {Vi, . . . ,Vk} where y* = y Vk for all i e Vk 
andy* < y* for all el. 

Many such feasible partitions exist that are not optimal (e.g., set all fits to the average of all 
observations). Condition (a) above must also be satisfied and gives the motivation for how to 
partition the set of observations in a manner that leads to the optimal partitioning. The partitioning 
scheme is detailed for the general case of convex loss functions in the next subsection. Here, we 
only give a general idea of how IRP works. 

IRP starts with the entire dataset as one group V and iteratively splits it into an increasing 
number of groups, until the optimal solution of (0Q> is reached. At each iteration, the algorithm 
chooses a sub-optimal group and partitions it into two groups by solving a specially structured 




(6) 
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linear program, detailed in the next subsection, that is amenable to very efficient algorithms. If 
the partition puts all observations into one group, it can be shown that the group is a block, i.e., 
optimal. Otherwise, the fits in the two resulting groups are recalculated as their averages (via © 
above), while the rest of the groups and their fits remain at their values in the previous iteration. 
IRP is thus an iterative scheme that splits a group at each iteration and never merges two groups 
back together; therefore, IRP is referred to as a no-regret partitioning algorithm. 

Two important theorems are proven (Luss et al. 2012) with respect to IRP. The first states that 
the new solution obtained after each partitioning step still satisfies isotonicity. After iteration K, 
there are K + 1 groups, Vi, . . . , Vk+i, in the partitioning with fit to y Vk for all i E with 
k G {1, . . . , K + 1}. The theorem thus says, that at each iteration K, the fits from the partition- 
ing Vi, . . . , Vk+i provide a potential isotonic prediction model. The second theorem shows that 
the IRP scheme terminates at the globally optimal partitioning. Hence, IRP produces a path of 
increasingly complex (since each iteration adds a partition) isotonic solutions, terminating in the 
optimal solution of ([I]). These theorems are made possible because of the particular splitting cri- 
terion used in the IRP algorithm, which is amenable to efficient calculation as mentioned above. 
The generalized version of the splitting criterion and the resulting algorithm are discussed next. 

2.2 The partitioning algorithm 

As with IRP, we solve a sequence of subproblems in order to solve the generalized isotonic re- 
gression problem ©; each subproblem divides a group of observations into two groups at each 
iteration. An important property of IRP with the l 2 loss function is that observations separated at 
one iteration remain separated at all future iterations. The same property applies here and implies 
that the total number of iterations is bounded by the number of observations n. 

The partitioning algorithm is motivated by the optimality conditions for the generalized isotonic 
regression problem ©. The optimal solution to ©, denoted by y*, are identical to conditions (a)- 
(d) in Section [2TT1 above, with the exception that condition (a) now has the generalized form 
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(a) 



dfidi) I 
din I y* 



where again \*- is the optimal dual variable associated with isotonicity constraint tji < yj. Con- 
vexity of the loss function again implies that any solution satisfying the optimality conditions is a 
globally optimal solution. The structure of the optimal solution as a partitioning of isotonic blocks 
can be seen from the KKT conditions as described in Section 12.11 Within each block, the fit to 
each observation for the general case is taken to be the weight of the observations in the block as 
defined by ©. Isotonicity of the two blocks V~ and V + , i.e., V~ ^ V + , means that w v - < w v +. 
From condition (a), summing over all observations in a block V, i.e., optimal group, gives 



Derivation of the partitioning step is as follows. Consider a group V where y* = wy for all 
i EV.lfV is an optimal group, it is a block and must satisfy ©. If it is not optimal, however, we 
can find a partitioning of V into two isotonic groups V + and V' such that 



The first summation over i E V + is the change in the objective value of problem © due to an 
increase in the fits of observations in V + . The second summation over i E V~ is the change in the 
objective value due to a decrease in the fits of observations in V~. Such a partition thus means that 
increasing the fits in V + to be greater than wy while decreasing the fits in V~ to be less than wy 
(which by definition maintains isotonicity of the fits) will cause an overall decrease in the objective 
value to problem ©. Fits that decrease the overall objective value can be achieved by fitting the 
observations in V + and V~ to their respective weights w v + and w v - . Hence, we search for an 
isotonic partitioning of V into V + and V~ that minimizes the lefthand term in ©. 



Denote by C v = {(V~, V+) : V~, V+ C V, V~ U V+ = V, V~ n V+ = {}, fix E V~ , y E 




(7) 




(8) 
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V + s.t. y ■< x} the set of all feasible (i.e., isotonic) partitions defined by observations in V. 
Partitioning is referred to as making a cut through the variable space (hence the optimal partition 
is made by an optimal cut). The optimal cut is determined as the partition that solves the problem 



min { y 



dfi{yi 



dm 



- E 

W V i€V - 



dfiiHi) 



dm 



} 



(9) 



where V (V + ) is the group on the lower (upper) side of the edges of the cut. The optimal cut 
problem © can be expressed as the binary program 



mm 



in {}] Xj 



dfi(m 



:xi<Xj V(i,j) el,Xi e {-l,+l} VieV}. 



(10) 



wi- 



lt is well-known (Murty 1983) that the continuous relaxation to this binary program (i.e., replacing 
the constraints X{ E { — 1,+1} by — 1 < x.- t < 1 for alH 6 V) is solved on the boundary of 
the feasible region with x* E {—1, +1} for all z = 1 . . . n. Thus the optimal cut problem © is 
equivalent to solving the linear program 



min{,2 T a; : Xi < Xj V(i, j) E X, — 1 < Xi < 1 Vi E V} 



(ID 



where Zj = (dfi(m)/dyi)\w v - Problem (fTTI) with Zi = 2(y v — yi) gives the linear program used 
to make partitions in IRP with l 2 loss function as described above in Section 12.11 As seen by 
property d8), a property of this optimal cut for generalized isotonic regression is that the sum of loss 
functions with Xj = +1 (Xj = —1) can be decreased by increasing (decreasing) the corresponding 
fits. That is, by increased (decreasing) w v for observations i with x^ = +1 (Xi = —1), the total 
change in loss is decreased, i.e., 



E 



dfi{m 



{i-.Xi=+l} 



dm 



Wy 



< and 

{i-.Xi=—l} 



dfi{m 



> 0. 



(12) 
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This group-wise partitioning operation is the basis for our algorithm which is detailed in Al- 
gorithm \T\ The algorithm differs from IRP only in Step 6 (they are obviously identical when /(•) 
is the Z 2 loss). Initially, all observations are in one group. Each iteration splits a group optimally 
by solving subproblem <TTTT > . A list C of potential optimal cuts for each group generated thus far 
is maintained, and, at each iteration, the cut among them with the smallest (most negative) objec- 
tive value is performed. Partitioning of a group ends when the solution to (fTTT) is trivial (i.e., no 
split is found because the group is a block). The algorithm stops when no further groups can be 
partitioned. 



Algorithm 1 Generalized Isotonic Recursive Partitioning 



Require: Observations (xi, yi), . . . , (x n , y n ) and partial order X. 

Require: k = 0, A = . . . , x n }},C = {(0, {x lt x n }, {})},£ = {}, M = (A, w A ). 
l: while .4 ^ {} do 

Let (val, w~, w + ) E C be the potential partition with largest val. 
Update A=(A \ (w~ U w + )) U {w~, w + }, C = C\ (val, w~, w + ). 
M k = (A,y A ). 
for all v G w + } \ {} do 

Set a = I Wiev where w v is the weight (DJ) of the observations in v. 

1 dyj \w v " ° 

Solve LP ([Til) with input z and get z* = argmin LP lFITj) . 
if z\ — . . . = z* (group is optimally divided) then 

Update A = A \ v and B = B U {v}. 
else 

Let v~ = {xi : z* = -1}, v + = {xi : z* = +1}. 
Update C = CU {(c T z*, v~, v+)} 
end if 
end for 
k=k+l. 
end while 

return Ai, a sequence of isotonic models, where M k contains the k th iteration's partitioning 
of observations and corresponding group weights. 
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Each iteration k of Algorithm [Qproduces a model M k by fitting each group in M k to its weight. 
For a set of groups V = {Vi, . . . , V k }, denote w v = {w Vl , ■ ■ ■ , w Vk }- Then model M k = (V, w v ) 
contains the partitioning V as well as a fit to each of the observations, which is the weight, as 
defined by ©, of the group it belongs to in the partition. 
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2.3 Properties of the partitioning algorithm 

So far, we have detailed the partitioning algorithm which is based on iteratively solving problem 
(fTTI) . but we have not yet shown that partitioning according to this particular scheme, i.e., solving 
problem (fTTI) . optimally solves the generalized isotonic regression problem. Theorem [TJnext states 
the main result that implies Algorithm [Jj is a no-regret partitioning algorithm for © (no-regret in 
the same sense as described for IRP in Section |2TI) . In the case of l 2 isotonic regression, this result 
is already known (Maxwell & Muckstadt 1985, Spouge et al. 2003, Luss et al. 2012). This theorem 
leads to our convergence result. The proof requires straightforward changes to the proof in Luss 
et al. (2012) based on the definition of convexity, the new algorithm cut in (fTTI) . and its properties 
(p"2~J) : the proof is thus left to the Appendix. 

Theorem 1 Assume group V is the union of blocks from the optimal solution to problem ([2]). Then 
a cut made by solving ( 1771) at a particular iteration does not cut through any block in the global 
optimal solution. 

The case of multiple observations at the same coordinates can be disregarded. Let V denote a 
set of groups where each group in V contains observations with the same coordinates, i.e V G V 
denotes the indices of multiple observations and \V\ — 1 means that V is a single observation. 
Then, we define gv{yv) = J2iev fi(vv) where yy G R and modify /(■) in the generalized isotonic 
regression problem © to be f(y) = Ylv&v 9v{yv) where y G R' v ' and each function gv{-) 
satisfies the necessary properties for applying GIRP. 

Since Algorithm [TJ starts with the union of all blocks for the first partition, we can conclude 
from this theorem that Algorithm [TJnever cuts a block when generating partitions. From the deriva- 
tion of the partitioning problem, it is clear that if an isotonic partition can be made, it will be made; 
that is, the algorithm will not stop early. Convergence of Algorithm [TJ to the global isotonic solu- 
tion with no regret then follows by repeatedly applying Theorem [TJ until all blocks of the optimal 
solution are identified. The next theorem states that AlgorithmfUprovides isotonic solutions at each 
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iteration. This result implies that the path of solutions generated by Algorithm Q] can be regarded 
as a regularization path for the generalized isotonic regression problem ©. Proof of this theorem 
is again held until the Appendix for the same reasons given above. 

Theorem 2 Model Mk generated after iteration k ofAlgorithm\J}is in the class of isotonic models. 

Complexity analysis of Algorithm [Qdepends on the number of observations n and isotonic con- 
straints m, and the complexity of solving linear program (TTTT >. Firstly, we assume that computing 
the weight of a group V via © requires computationally less effort than solving problem (fTTI) (in 
practice these problems are one-dimensional convex minimization problems that are easily solved 
with a binary search). In short, linear program (ITD is dual to a linear maximum flow network prob- 
lem (Ahuja et al. 1993), which is a well-studied problem. It can be solved in 0(mn log n) (Sleator 
& Tarjan 1983) or 0(n 3 ) (Galil & Naamad 1980) in the general case that we consider; special 
cases such as n = 2 or where the observations lie on a grid can be computed even faster (Spouge 
et al. 2003). Choice of algorithm depends on m which is 0(n 2 ) in the worst case. Given GIRP 
requires at most n iterations, this leads to worst case complexities of 0(mn 2 logn) or 0(n A ). A 
recent problem reduction by Stout (2010) can be used to obtain an equivalent representation of the 
desired problem with d-dimensional data and 0(n log d_1 n) constraints and observations, which 
can be useful when m is large. Finally, Luss et al. (2012) show that IRP performs in 0(Cn 3 ) in 
practice, where C is a function of the fraction of observations on each of the cut at each iteration. 
The same result applies here. 

2.4 Relations to Other Generalized Isotonic Regressions 

We here formalize the relationship between GIRP and the work of Barlow & Brunk (1972), which 
was hinted at in Luss et al. (2012) and mentioned in the introduction above. The generalized 
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isotonic regression problem of Barlow & Brunk (1972) is of the form 

n 

min - ViVi Vi < Vj V(i, j) G 1} (13) 

i=i 

in y G R" where we have left out weights for simplicity. While they allow $ : R — > R to be 
nondifferentiable, we consider here only the differentiable case and denote (/>(•) as the derivative of 
$(•). Let z* be the solution of © (i.e., h isotonic regression) with given observations yi. Theorem 
3.1 of Barlow & Brunk (1972) claims that the solution to (fTSl can be obtained as 

y* = d4) 

where is defined by <p~ l (4>(x)) = x for all x G R. Thus, any objective of the form (fTSl 

can be solved by computing the / 2 isotonic regression on input observations and then transforming 
the solution using (IT4|) . In this manner, IRP can be used to solved the somewhat limited class 
of generalized isotonic regression problems defined in Barlow & Brunk (1972) (note that without 
requiring $(•) proper, their theory would apply to any convex loss function). However, it is also 
clear that Algorithm [T] provides the tools for solving more general isotonic regression problems 
than (fT3l . e.g., as in the case for the p-norm or Huber's loss function. 

The same transformation from Barlow & Brunk (1972) can be used to derive an isotonic regu- 
larization path for generalized isotonic regression problems with the structure of (fTSl . Indeed, this 
can be shown using the above framework for our general isotonic regression problem ©, and is 
formalized in Proposition |3] 

Proposition 3 Problem rfTil) can be solved either by 

1. Applying IRP to the observation data y to obtain z* and tranforming using H4\) . 

2. Applying GIRP directly to rf7?1) . 

Furthermore, both algorithms are equivalent when applied to f fiil) in the sense that the regulariza- 
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tion path of partitions for each algorithm are equivalent. 

Proof. IRP can be used to solve the l 2 isotonic regression problem to obtain z*. Application of 
Theorem 3.1 of Barlow & Brunk (1972) gives the solution to (TT31 via (fl4l) . In order to apply GIRP, 
let fi(yi) = - yM and (d / = - y< where </»(■) denotes the derivative of 

$(■). Then roy as defined by © satisfies £\ ((j)(w v ) — y^) — giving uy = where y v 

denotes the mean observation over group V and is defined by = x for all igR. 

Hence, here (dfi(yi) /dy~i)\ Wv =y v ~ Hi an d the GIRP partitioning problem (fTTb is equivalent to 
the corresponding partition problem in Luss et al. (2012). ■ 

The only difference between IRP and GIRP for solving problem (fT3l is that GIRP fits obser- 
vations to the transformed isotonic regression fits wy = _1 (y V /) along the path, while IRP fits 
observations to the mean of the group's observations and the transformation is done on the final 
optimal partitioning. It is easy to see that the transformation (fl4)) can be applied to each iteration 
of IRP along the path in order to obtain an equivalent path to that of GIRP. 

This connection also relates the algorithm of Maxwell & Muckstadt (1985), which solves prob- 
lem ( JT3]) with = to IRP. Due to the analysis here, these algorithms are actually equiv- 
alent. Both the problem of Maxwell & Muckstadt (1985) and l 2 isotonic regression are specific 
instances of the more general problem © solved in this paper. It should be noted that Maxwell 
& Muckstadt (1985) did not make use of, or even recognize, the regularization path which plays a 
significant role for isotonic regression in dimension d > 1. 

Lastly, Hochbaum & Queyranne (2003) offer another partitioning algorithm for problem © 
with additional integer constraints. GIRP, in the continuous case, solves cut problem © because 
we know the fit within optimal groups (i.e., the weight of the group). Rather, in the integer case, 
the cut problem ® is solved instead with the derivatives evaluated at some a taken as the median 
of an interval in which the optimal fits lie. A theorem states that this partition problem divides the 
group into two groups V~ and V + , where optimal fits to observations in V~ are less than a and 
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optimal fits to observations in V + are greater than a. The problem is thus stated as determining 
a sequence aii, . . . , a t such that observations with optimal fits in the interval (a i7 a i+ i] have the 
same optimal fit. Given a criterion for determining when a is a breakpoint in this sequence, their 
algorithm can do better than a binary search. In fact, they further suggest a method that has a worst- 
case complexity equivalent to solving three max-flow problems. The complexity comes from using 
the information in previous max-flow problems to start new max-flow problems. A similar idea 
could possibly be applied in our continuous case where the search for breakpoints uses the group 
weight in the cut problem. This highly efficient algorithm does not provide the exact solution to 
the continuous case, but a regularization path based on the bounds they get when searching for 
breakpoints can be considered for the integer case, and in turn, for the problem on an e-grid. 

2.5 Regularization by recursive partitioning 

GIRP obtains the solution to problem (0 by recursively partitioning the covariate space into pro- 
gressively smaller regions and fitting the best constant in each region, referred to here as the weight 
which is defined by ©. As such, it is natural to think of the resulting sequence of models created 
from early stopping as a regularization path of models of increasing complexity, indexed by the 
number of iterations of the algorithm. Other examples of using early stopping for regularization 
include training neural networks with back propagation (Caruana et al. 2000) and boosting (Rosset 
et al. 2004). Extensive experience of the usefulness of regularization in high dimensional fitting 
(Wahba 1990, Tibshirani 1996, Scholkopf & Smola 2001), and especially in nonparametric mod- 
els like isotonic regression, suggests that regularization, embodied in this case by early stopping 
of the algorithm, can lead to reduced overfitting and hence improved predictive performance. As 
Theorem Vindicates, when stopping early and fitting the weight to each region, we are guaranteed 
to obtain a feasible isotonic model. 

While GIRP uses early stopping for regularization of the globally optimal isotonic model, 
we note that regularization commonly refers to learning a model by explicitly constraining the 
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family of models that are considered, and optimizing over this family. Early stopping after the 
k th iteration of GIRP produces an isotonic model with k cuts obtained through a sequence of local 
optimization problems. However, this model is not the solution to any global optimization problem. 
The k th model of GIRP is thus only one potential model with k cuts that satisfies the isotonicity 
constraints. One might, for example, seek a regularized isotonic model that minimizes loss subject 
to the isotonicity constraints such that exactly k cuts are made. The k th model in this case has a 
clear interpretation and more flexibility than the k th GIRP model. While this would certainly be an 
interesting problem to consider, it is combinatorially difficult and the authors do not know of any 
efficient methods for solving it. 

In Luss et al. (2012), model complexity of l 2 isotonic regression along the IRP path is quantified 
through the concept of equivalent degrees of freedom (DFs) as defined by Efron (1986) and Hastie 
et al. (2001). The initial iterations of IRP are shown to perform much more fitting than later 
iterations, and this phenomenon becomes more pronounced as the dimension d increases. For 
example, in dimension 6, often 50% of DFs were fitted by the first IRP iteration. Although the 
model complexity and DF measures of Efron (1986) do not generalize to non-Z 2 loss as used in 
GIRP, the general spirit of this result should persist. Intuitively, because the space of isotonic splits 
of the entire covariate space probed in the first iteration is much larger than the space of possible 
isotonic cuts in further iterations, finding the optimal first split corresponds to a significant portion 
of all fitting. 

These two effects — importance of early stopping coupled with the high portion of fitting in 
earlier iterations — are demonstrated empirically in the experiments of the next section, where the 
best performing solution along the GIRP path is compared to the optimal solution of problem © 
in terms of predictive performance. 
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3 Performance evaluation 



We here demonstrate usefulness of the partitioning algorithm for generalized loss functions. The 
contribution of our generalization is specifically illustrated by the use of Huber loss, which proves 
to be very effective in the case of outliers. We first exhibit the computational performance of GIRP 
and show that the algorithm can be applied to large-scale problems. We then consider synthetic 
data sets that demonstrate the impact of regularization and conclude with an example on real data. 

3.1 Practical Computational Performance 

Solving the multivariate isotonic regression problem with general loss functions such as Huber's 
loss was previously a computationally difficult problem. For certain loss functions, the isotonic 
regression problem can be reformulated and solved with off-the-shelf convex optimization solvers. 
For example, isotonic regression with Huber's loss can be reformulated as a quadratic program by 
adding many variables to the optimization problem. Simulations with 1000 training points were 
solved in 2.3 seconds with GIRP versus 135 seconds using Mosek (MOSEK ApS 2011) to solve 
the quadratic program (averaged over 50 simulations). This simple experiment demonstrates that 
GIRP, which is specifically designed for isotonic regression problems, is clearly a much more 
practical tool than using off-the-shelf generic solvers and makes generalized isotonic regression 
problems amenable to large-scale problems. 

Figure Q] (left) illustrates that GIRP can solve large-scale problems with Huber's loss. The i th 
observation in each simulation is generated as = (Y\ x^) + Af(0,d 2 ) with Xij ~ W[0,2], d 
representing the dimension, and outliers randomly inserted. Results are averages over 50 simu- 
lations. Isotonic regression in 8 dimensions with 20,000 training instances is solved in less than 
one minute. Figure CD (right) shows the number of partitions that GIRP performs on average for 
varying dimensions. More training data and higher dimension typically implies more complex 
isotonic models, resulting in more partitioning problems and more computational time. The com- 
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putational limitation of training the isotonic model with GIRP is solving the partitioning problem. 



Luss et al. (2010) further offers a heuristic for solving the partition problem that makes training 
isotonic regression problems with up to 200,000 training instances easily feasible. 
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Figure 1: Left: Computational performance of training isotonic models with GIRP on a simulation with 
Huber's loss for varying number of dimensions and training instances. Right: Complexity of isotonic models 
as measured by the number of partitions for varying number of dimensions and training instances. 



3.2 Simulations 

Experiments are run on two different loss functions. In the first experiment, count data is sim- 
ulated from Poisson distributions where the average number of occurrences is generated by two 
different isotonic models. Generalized isotonic regression models for the Poisson rate are obtained 
by minimizing negative Poisson log-likelihood subject to isotonicity constraints. In the second set 
of experiments, observations are generated by two different isotonic models and .5% of the train- 
ing observations are multiplied by a large constant to make them outliers. Generalized isotonic 
regression models are obtained using 5-Huber loss. Note that Poisson isotonic regressions can be 
handled using IRP due to the theory of Barlow & Brunk (1972), while Huber isotonic regressions 
require using GIRP. 

Our experimental framework is as follows. A training and testing set are independently simu- 
lated by a fixed distribution. Training and testing sets have 15000 and 3000 observations, respec- 
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tively. A model is first generated on the training data. In the case of GIRP, the training data is 
split into a subtraining set of 12000 observations and a validation set of the remaining 3000 ob- 
servations. A path of isotonic models is generated by running GIRP on the subtraining data. The 
validation data is used to select the regularization level (stopping point), and the resulting model is 
applied to predict the testing data. With respect to parametric regression, e.g., Poisson and Huber 
regressions, models are trained on the full 15000 observation training set and tested on the 3000 
observation testing set. Results are based on averaging fifty simulations. 

The first two examples use Poisson negative log-likelihood as the loss function. Data for the 
two simulations is generated as x^ ~ U[0, 10] and x^ ~ U[5, 10] (the coordinates of x are drawn 
i.i.d in all our experiments), respectively. The i th observation in each simulation is generated 
as %ji ~ Poisson(^. y/Xi]) and ~ Poisson(^. x?-), respectively. The isotonic models are 
compared to the results of a Poisson regression, and performance here is measured by negative 
Poisson log-likelihood. The regularized model generated by the minimum loss along the GIRP 
curve (GIRP Min Poisson) is compared with the final GIRP model (GIRP Final Poisson) and 
with the Poisson regression model. In practice, one would only consider predictions using the 
regularized model, but here we want to compare against the unregularized model as well. Table |2] 
demonstrates that Poisson isotonic regression works well with a reasonable number of variables (2- 
5 for the first simulation and 2-3 for the second simulation), however is outperformed by the simple 
Poisson regression with more than 5 variables. In comparing the regularized GIRP model with the 
final GIRP model, there is no statistical difference in this example. The next simulation clearly 
exemplifies the effect of regularization, in addition to the use of generalized isotonic regression. 

The second two examples use 5-Huber loss as the loss function for generating models. Data for 
the two simulations is generated as ~ U[0, 3] and x^ ~ U[0, 5], respectively. The i th observa- 
tion in each simulation is generated as y { = {J\. Xij)+M(0, d 2 ) and y-i = . x?-)+jV(0, (1.5<i) 2 ), 
respectively, where d is the dimension. For a randomly chosen 0.5% of the training data, the ob- 
servations are multiplied by a factor of 20. The generalized isotonic models are compared to the 
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results of a Huber regression, and performance here is measured by mean squared error. Note that 
we assume that squared error loss represents the true objective performance; the models are fit 
using Huber loss in order to avoid sensitivity to outliers. The regularized model generated by the 
minimum loss along the GIRP curve (GIRP Min Huber) is compared with the final GIRP model 
(GIRP Final Huber) and with the Huber regression model. Table [3] demonstrates that Huber iso- 
tonic regression works well with a reasonable number of variables (2-5 for the first simulation 
and 2-4 for the second simulation), however, again, a simple Huber regression outperforms GIRP 
for higher dimensions due to overfitting. An important note here is the effect of regularization. 
The average loss using the unregularized isotonic model is not statistically superior at any dimen- 
sion to the average loss using a Huber regression while the regularized isotonic model produces 
statistically improved performance. 

Figures |2] and |3] display regularization paths for the Poisson and Huber simulations, respec- 
tively. Each curve shows the performance from using increasingly complex models generated by 
GIRP. Take, for example, the first curve (d = 2) under Model 1 in Figure El The x-axis states the 
number of partitions in the particular GIRP model and the y-axis measures the negative Poisson 
log-likelihood of using the GIRP models (trained on the subtraining data) to predict the validation 
data. As the number of partitions increases (i.e., as the model becomes more complex), perfor- 
mance improves. Consider next d = 5 under the same model. After 12 iterations of GIRP the 
performance begins to worsen (the minimum along each curve is shown by a diamond). This is 
exactly the effect of regularization. Performance improves as the model complexity increases up 
to a certain point at which increasing the complexity further overfits the model and performance 
declines. Thus, as done to obtain the performance in Tables |2] and |3j the model along the path that 
gives the best performance on the validation data is used to make predictions on the independent 
testing data. 

The curves in Figure |3] show similar paths for the generalized isotonic regressions with Huber 
loss where performance is measured by mean squared error. Here the effects of regularization are 
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much more pronounced than they are in the Poisson simulations. This suggests that robust re- 
gressions on applications where isotonicity is desired would greatly benefit from the regularization 
of GIRP with Huber loss. We next exhibit this robustness effect on a data set for predicting the 
miles-per-gallon of automobiles. 



Model 1: 2/i ~ Poisson (FT- y/xij) with x-ij ~ U[0, 10] 



Dim 


GIRP Min Poisson 
Neg. Log-Likelihood 


GIRP Final Poisson 
Neg. Log-Likelihood 


Poisson Regression 
Neg. Log-Likelihood 


Min 
Path 


GIRP 
Length 


2 


30678.65 (± 23.83) 


30678.73 (± 23.84) 


32031.44 (±24.22) 


217 


298 


3 


36395.90 (± 38.03) 


36397.99 (± 37.84) 


40421.91 (± 37.30) 


90 


618 


4 


43908.67 (± 54.93) 


43944.68 (± 56.47) 


54108.80 (± 60.77) 


78 


584 


5 


66812.67 (± 240.06) 


68096.06 (± 347.41) 


81096.57 (± 132.33) 


30 


371 


6 


200068.80 (± 1072.14) 


220308.20 (± 2059.38) 


140478.56 (± 398.51) 


9 


479 


Model 2: j/j ~ Poisson(^ ? . x?-) with ~ W[5, 10] 


Dim 


GIRP Min Poisson 
Neg. Log-Likelihood 


GIRP Final Poisson 
Neg. Log-Likelihood 


Poisson Regression 
Neg. Log-Likelihood 


Min 
Path 


GIRP 
Length 


2 


56957.75 (± 31.19) 


56957.77 (± 31.20) 


57802.85 (± 31.17) 


661 


794 


3 


60650.13 (± 30.83) 


60650.62 (± 31.32) 


60861.90 (± 29.95) 


105 


1239 


4 


64008.55 (± 40.17) 


64041.56 (±44.10) 


62956.84 (± 23.68) 


57 


1105 


5 


67837.30 (± 50.78) 


68182.67 (±78.18) 


64590.51 (± 30.33) 


16 


806 


6 


74438.33 (± 97.72) 


75479.73 (±91.07) 


65936.54 (± 28.06) 


16 


544 



Table 2: Statistics for count data simulations generated by two different models as labeled above. GIRP Min 
(Final) Poisson Neg. Log-Likelihood (LL) refers to the negative Poisson LL of predicting on independent 
testing data using the model that produced the minimum (final) loss along a regularization path generated on 
training data. Min Path is the number of partitions made to generate the minimum negative Poisson LL and 
GIRP Path Length is the number of partitions in the global generalized isotonic solution. Poisson Regression 
Neg. LL is the negative Poisson LL from using Poisson regressions. Bolded MSE values for minimum and 
final GIRP negative Poisson LL indicate that they are significantly lower than the negative LL of the Poisson 
regression at level .05. 



3.3 Predicting Miles-Per-Gallon 

The next example compares the l 2 versus Huber loss regressions. Note that Huber loss function 
is an example that cannot be solved using the theory of Barlow & Brunk (1972) and I2 isotonic 
regression. This example uses a data set of 392 automobiles (Frank & Asuncion 2010) and models 
miles-per-gallon using the following seven variables: origin, model year, number of cylinders, 
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Model 1: Vi = (J] 7 - x tj ) +N(0, d 2 ) with Xij ~ W[0, 3 




Dim 


GIRP Min Huber 
MSE 


GIRP Final Huber 
MSE 


Huber Regression 
MSE 


Min 
Path 


GIRP 
Length 


2 


4.21 (± 0.30) 


4.35 (± 0.36) 


4.55 (± 0.03) 


49 


421 


3 


9.69 (± 0.07) 


11.71 (± 2.44) 


13.18 (± 0.10) 


27 


1607 


4 


22.93 (± 0.26) 


90.83 (± 64.78) 


36.94 (± 0.51) 


10 


3645 


5 


83.20 (± 1.23) 


280.08 (± 106.02) 


115.47 (± 2.43) 


6 


5783 


6 


370.56 (± 10.41) 


2080.71 (± 915.59) 


391.01 (± 12.09) 


3 


7531 


Model 2: w = . x?-) + A/"(0, (1.5d) 2 ) with ~ W[0, 5] 


Dim 


Huber Min Huber 
MSE 


GIRP Final Huber 
MSE 


Huber Regression 
MSE 


Min 
Path 


GIRP 
Length 


2 


9.60 (± 0.07) 


14.12 (± 8.35) 


15.98 (± 0.11) 


57 


1154 


3 


23.79 (± 0.20) 


60.17 (± 34.96) 


30.61 (± 0.23) 


33 


3283 


4 


48.20 (± 0.41) 


193.37 (± 83.58) 


50.02 (± 0.36) 


16 


5705 


5 


85.62 (± 0.58) 


599.59 (± 298.06) 


73.44 (± 0.54) 


8 


7785 


6 


145.06 (± 1.34) 


1602.43 (± 620.45) 


101.12 (± 0.73) 


8 


9283 



Table 3: Statistics for count data simulations generated by two different models as labeled above. GIRP 
Min (Final) Huber MSE refers to the MSE of predicting on independent testing data using the model that 
produced the minimum (final) loss along a regularization path generated on training data. Min Path is the 
number of partitions made to generate the minimum MSE and GIRP Path Length is the number of partitions 
in the global generalized isotonic solution. Huber Regression MSE is the MSE from using Huber regressions. 

acceleration, displacement, horsepower, and weight. Isotonic regression with an l 2 loss function 
was already shown to be useful for this data set in Luss et al. (2012). In this experiment, we 
have modified one data point to be an outlier. This random point is chosen such that isotonicity 
constraints require its fit to be less than the fits of five other data points. The experiment simulates 
a real-life outlier problem which affects the training sample but should not affect prediction. We 
assume squared error loss to be the true prediction criterion (therefore the out-of-sample evaluation 
criterion), and fit models with Huber loss to avoid sensitivity to outliers. Table|4]displays the results 
of one random division of the data (2/3 for training, 1/3 for testing). A paired t-test comparing the 
out-of-sample predictive performance of the two models (IRP and GIRP) confirms the significant 
edge of the model generated with Huber's loss function in this setting. 
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Figure 2: Normalized negative Poisson log-likelihood (LL) for predictions of count data simulations with 
different dimensions d. Each path is normalized by the loss of the initial model. The x-axis in each fig- 
ure corresponds to the number of partitions made by GIRP, i.e., the curves show how the normalized 
negative Poisson LL of test data varies as the GIRP algorithm progresses. Model 1 uses the function 



yi ~ Poisson (n 



■ i j 



with Xij ~ U[0, 10] and Model 2 uses the function ?/, ~ Poisson(^ ■ x\ L) with 



Xij ~ U[0, 5]. Fifty simulations were run with 12000 training and 3000 testing points. Only the first few 
hundred partitions of the paths are displayed in order to make the loss of the earlier GIRP iterations visually 
clearer. Scales also differ in order to make the shapes of the curves clear. 
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Figure 3: Normalized MSE for predictions of two isotonic simulations with different dimensions d. Each 
path is normalized by the MSE of the initial model. The x-axis in each figure corresponds to the number 
of partitions made by GIRP, i.e., the curves show how the normalized MSE of test data varies as the GIRP 
algorithm progresses. Model 1 uses the function yi = (J"J . xij) + Af(0, d 2 ) with x\j ~ U[0, 10] and Model 
2 uses the function j/j = • xfj) + A/"(0, (1.5o?) 2 ) with ~ W[0, 5]. Fifty simulations were run with 
12000 training and 3000 testing points. Only the first few. 

4 Conclusion 

In this paper, we show how a relatively minor adjustment to the previously proposed IRP algorithm 
leads to a generalization allowing us to efficiently fit isotonic models under any convex differen- 
tiable loss function. Our proposed GIRP algorithm also generates regularized isotonic solutions 
along its path, in addition to the optimal isotonic solution. An important remaining challenge is 
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Number 


IRP LS 


GIRP Huber 


IRP LS 


IRP LS 


GIRP Huber 


GIRP Huber 


Variables 


Min MSE 


Min MSE 


Min Path 


Path Length 


Min Path 


Path Length 


1 


37.47 ± 9.67 


38.48 ± 10.42 


2 


3 


2 


3 


2 


31.22 ±7.04 


27.01 ± 6.28 


17 


17 


7 


16 


3 


21.19 ± 6.75 


15.83 ±4.76 


12 


30 


5 


30 


4 


22.90 ± 6.78 


15.53 ± 4.01 


4 


53 


11 


54 


5 


19.94 ± 7.06 


10.95 ± 3.15 


4 


78 


29 


81 


6 


17.55 ± 5.99 


9.78 ± 2.89 


4 


86 


69 


90 


7 


18.91 ± 6.29 


10.24 ± 3.51 


4 


95 


71 


95 



Table 4: Statistics for auto mpg data. Miles-per-gallon is regressed on a seven potential variables: ori- 
gin, model year, number of cylinders, acceleration, displacement, horsepower, and weight. A comparison 
between the results of IRP and GIRP with Huber loss is shown. One data point y, L such that X{ H Xj for 
j = 1 ... 5 was modified to be an outlier. 5 for Huber loss is set to one standard deviation of the training re- 
sponses. Bold demonstrates statistical significance with 95% confidence determined by a paired t-test using 
131 out-of-sample observed squared losses obtained from models trained on 261 in-sample observations. 

to generalize the approach to handling convex non-differentiable loss functions (like absolute loss 
or the hinge loss of support vector machines), an important topic for future research. Our analysis 
does not hold in this case due to nonuniqueness of the subproblems. 
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5 Appendix 



We need the following additional terminology: A group X majorizes (minorizes) another group Y 
if X y Y (X 2< Y). A group X is a majorant (minorant) of X U A where A = if X Ai 

(X ^ Ai) Vi = 1 . . . fc. 



Theorem [B 

Proof. We prove by contradiction. Assume there exists a union of K blocks in V in the optimal 
solution labeled Ai = M\ U . . . U Mk that get broken by the cut, with M\ and Mk as the minorant 
and majorant block in Ai, and and Mj/ as the groups in Mk below and above the cut. Define 
C as the union of all blocks in V that lie "below" the algorithm cut, hi as the union of all blocks 
in V that lie "above" the algorithm cut. Further define A K C C (A^ C hi) as the union of blocks 
along the algorithm cut such that A K y M K (A^ -< ). Figure |4] depicts an example of these 
definitions where A\ — A\ — A K = A K = {} for simplicity. 

We first prove that wm x > wv- First, consider the case A\ = {}. By convexity of /j(-) and 
summing over group M^ , we have 



Y fi( w M?) ^ Y ^ Wv ^ + ( w m? ~ w v) Y 



dfi(yi 



Definition of the weight operator gives 



Y fi( W Mu) < Y fi( W v) ( W MU ~ W V ) Y 



dfiiVi) 



dm 



< 0. 



Finally, by the definition of the algorithm cut in ([Til since no block exists below M^ 7 to affect 
isotonicity, 



9jm 

dm 



< o 



(15) 



Wv 
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so that w M u > Wy. Since M\ is a block, we have w m l > w M u, and then 



W m l > W M y > W V => W Ml > Wy- 



For the case, Af ^ {}, we have wmi > w A u > wy with the first inequality due to optimality 
and the second follows directly the proof above replacing Mf 7 by Af . A proof for wm k < wv 
follows a similar argument focusing on M^. Putting this together gives wm 1 > wv > wm k > 
which contradicts that Mi and M K are blocks in the global solution, since by assumption then 
w Ml < w Mk . The case K = 1 is also trivially covered by the above arguments. We conclude that 
the algorithm cannot cut any block. ■ 



Figure 4: Illustration of proof of Theorem [TJ Black lines separate blocks. The diagonal red line through the 
center demonstrates a cut of Algorithm [TJ £ is the union of blue blocks below the cut and U is the union of 
green blocks above the cut. White blocks are blocks that are potentially split by Algorithm [TJ These blocks 
are split into M[, M 5 L below the cut and Mf, . . . , above the cut. In the proof, M, = Mf U Mf 
Vi = 1 . . . 5. The proof shows, for example, that if the algorithm splits M\ into Mf and according to 
the defined cut inQT] then there must be no isotonicity violation when creating blocks from and M-P '. 
However, since M\ is assumed to be a block, there must exist an isotonicity violation between Mf and , 
providing a contradiction. 



U 




L 
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Proof. The proof is by induction. The base case, i.e., first iteration, where all points form one 
group is trivial. The first cut is made by solving linear program (fTTI) which constrains the solution 
to maintain isotonicity. 

Assuming that iteration k (and all previous iterations) provides an isotonic solution, we prove 
that iteration k + 1 must also maintain isotonicity. Figure |5]helps illustrate the situation described 
here. Let G be the group split at iteration k + 1 and denote A (B) as the group under (over) the 
cut. Let A = {X : X is a group at iteration k + 1, 3i E X such that (i, j) E I for some j E A} 
(i.e., X E A border A from below). 

Consider iteration k + 1. Denote X = {X E A : Wa < wx} (i.e., X E X violates isotonicity 
with A). The split in G causes the fit in nodes in A to decrease. Proof that 



follows the proof of (TT3T > in Theorem 1 above so that wa > wg- We will prove that when the fits 
in A decrease, there can be no groups below A that become violated by the new fits to A, i.e., the 
decreased fits in A cannot be such that X ^ {}. 

We first prove that X = {} by contradiction. Assume X ^ {}. Denote k < k + 1 as the 
iteration at which the last of the groups in X, denoted D, was split from G and suppose at iteration 
ko, G was part of a larger group H and D was part of a larger group F. It is important to note that 
X f)(F U H) = {} E X \ D at iteration k because by assumption all groups in X \ D were 
separated from A before iteration i. Thus, at iteration k , D is the only group bordering A that 
violates isotonicity. 

Let Du denote the union of D and all groups in F that majorize D. By construction, Djj 
is a majorant in F. Hence wd v < wfuh by Algorithm \T\ and wa < wd v by definition since 
wd v > wd > wa- Also by construction, any set X E H that minorizes A has wx < wa (each set 
X that minorizes A besides D such that wx < wa has already been split from A). Hence we can 
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denote A L as the union of A and all groups in H that minorize A and we have wa > wa l and A L 
is a minorant in H. Since A L C H at iteration i, we have 



which is a contradiction, and hence the assumption X ^ {} is false. The first inequality is because 
the algorithm left Al in H when F was split from H, and the remaining inequalities are due to the 
above discussion. Hence the split at iterations k + 1 could not have caused a break in isotonicity. 

A similar argument can be made to show that the increased fit for nodes in B does not cause 
any isotonic violation. The proof is hence completed by induction. ■ 



Figure 5: Illustration of proof of Theorem |2] showing the defined sets at iteration k + 1. G is the set divided 
at iteration k + 1 into A (all blue area) and B (all green area). The group bordering A from below denoted 
by X\ (also referred to as D in the proof) is in violation with A. At iteration ko, G is part of the larger group 
H and X\ is part of the larger group F. At iteration ko, groups F and H are. separated. The proof shows 
that when A and B are split at iteration k + 1, no group such as X\ where wx 1 > wa could have existed. 
In the picture, X\ must have been separated at an iteration ko < k + 1, but the proof, through contradiction, 
shows that this cannot occur. 



w FuH < w Al <w a < w Du < W FUH 
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